An experiment in fuzzy matching, using SQL, with CockroachDB

An experiment in fuzzy matching, using SQL, with CockroachDB
[ Guides ]

There's a new SQL in town

Why Distributed SQL is the database, evolved.

Free O'Reilly Guide

A recent tweet inspired me to address the need for fuzzy matching by combining some existing capabilities of CockroachDB. Note the key features mentioned in the tweet:

  • similar but not equal sporting events names: a common pattern. Users tend to mis-type data in input fields, and data isn’t always correct. Nevertheless, we’d like to return the closest match.
  • I’d rather use this in-built feature than pay for a whole ES cluster with added maintenance overhead to boot: This is the second time I’ve heard this sentiment in the past couple of months. ES is a full-featured search engine and delivers a great experience but, for this purpose, would be overkill and would require additional time and expense to deploy and operate.

What I found interesting about this exercise is that this pattern of configuring a changefeed to route events through an external system, then back into the database has enormous potential, far beyond the fuzzy matching example that you’ll read about below. It seems that development practices have evolved away from burying logic in the database and towards coding it in the languages most applicable to the problem; the emergence of microservices aligns with this trend. That observation, combined with a new feature of being able to define a changefeed on a specific column family, has me convinced we’ll see some very interesting applications of this pattern with CockroachDB.

Now I’ll get off of my soapbox and into the demo!

Fuzzy matching twitter question

Demo Background

To begin with, CockroachDB isn’t a fork of PostgreSQL, so you don’t simply “bolt on” the usual Postgres extensions such as pg_trgm, the module mentioned in the tweet. But this is a small obstacle and, in fact, offers the opportunity to demonstrate an up-and-coming feature of CockroachDB Enterprise Changefeeds, aka “CDC”: the ability to configure a changefeed on a specific column family. This feature is available in the v22.1.0-beta.1 version I am using here – this release should be generally available by mid-May 2022.

The fuzzy matching experiment

Desired state

We do INSERT, UPDATE, DELETE on data about sports teams and would like to retrieve this data based on queries where we may misspell names. I confess to having misread the tweet, interpreting it as pertaining to sports team names as opposed to sport events names, so what I show here uses team name data, but I think it’s easy to apply to event names as well, so long as you have access to that data.

If we envision the requirement arises out of an application’s need to perform this type of matching, we can build a simple REST app which will:

  1. Provide a webhook endpoint to which CockroachDB changefeeds will send events
  2. Provide a REST endpoint for search, returning a ranked list of the top-N closest team name matches

DDL for the table

Here’s the DDL for the table. The FAMILY … parts of this enable the changefeed to generate events based solely on FAMILY f1, ignoring the resulting changes to FAMILY f2, which is what we need:

CREATE TABLE teams
(
  id UUID PRIMARY KEY DEFAULT GEN_RANDOM_UUID()
  , name TEXT NOT NULL
  , grams TEXT[]
  , FAMILY f1 (id, name)
  , FAMILY f2 (grams)
);

And here’s the inverted index on the grams column:

CREATE INDEX ON teams USING GIN (grams);

REST app: CDC / indexing

Now, we need the REST app. I use Python for this since it’s concise and easy to get up and running, and the Flask module works very well for REST apps. And, I had some code fragments hanging around which I could reuse pretty easily. The entire Python script is here, but I’ll focus on interesting aspects below.

  • Generating n-grams from strings:
def get_ngrams(s, n=3):
  return [s[i:i+n] for i in range(len(s) - n+1)]

For the input la galaxy, the return value is ['la ', 'a g', ' ga', 'gal', 'ala', 'lax', 'axy']. The default value of n is 3, which aligns with the way pg_trgm works.

A couple of things to point out:

  1. The text is lower cased by a separate Python function. This is common in information retrieval applications, since comparisons are typically done without regard to case.
  2. The ngrams show the result of sliding a window across the string, from left to right, yielding 3-character sequences which can span the space between words. This latter aspect factors in the adjacent words, so phrases are scored appropriately.
  • Webhook endpoint for the changefeed:
@app.route('/cdc', methods = ['POST', 'GET'])
def cdc_webhook():
  obj = request.get_json(force=True)
  print("CDC: " + json.dumps(obj)) # DEBUG
  for o in obj["payload"]:
    if o["after"] is None:
      pass # Nothing to be done here
    else:
      pk = o["after"]["id"]
      name = o["after"]["name"]
      index_string(pk, name)
  return "OK", 200

The changefeed will send an HTTP POST to this /cdc endpoint, as configured via the SQL statement shown below. When that happens, JSON is accessible by the call to request.get_json(force=True), and that JSON contains the current values of the id and name columns from the teams table.

  • Those values are passed to the index_string(pk, name) function, which just updates the grams column of the teams table:
def index_string(pk, content):
  ng = tokenize(content)
  stmt = text("UPDATE teams SET grams = :grams WHERE id = :pk").bindparams(grams=ng, pk=pk)
  run_statement(stmt)

I won’t get into the run_statement(stmt) function here, but will instead refer interested readers to the code itself.

  • Here is the SQL statement we need to run to configure the changefeed, tying all of this together:
CREATE CHANGEFEED FOR TABLE teams FAMILY "f1"
INTO 'webhook-https://localhost:18080/cdc?insecure_tls_skip_verify=true'
WITH updated, full_table_name, topic_in_value;

Note that this operation requires an Enterprise License, but is also available in the single node “demo” mode (see below).

REST app: search / fuzzy matching

A REST client can retrieve fuzzy matches for a given team name by doing something equivalent to this (this example was run on my MacBook; the base64 command works differently here than it does on Linux; pretty_print_json.py is here):

$ name="PA Galuxy"; time curl -k -s https://localhost:18080/search/$( echo -n $name | base64 )/5 | pretty_print_json.py
[
  {
    "name": "LA Galaxy",
    "pk": "15e240a7-d1db-4b77-b454-c895a11610bf",
    "score": "42.8571"
  },
  {
    "name": "LA Galaxy II",
    "pk": "85b5b97a-9a6e-4cef-b63c-7cbe123eca07",
    "score": "10.7143"
  },
  {
    "name": "LA Giltinis",
    "pk": "82b96763-6836-4ab8-84ad-1864a1f3e16d",
    "score": "4.7619"
  },
  {
    "name": "Tampa Mayhem",
    "pk": "6f5fd5e0-6654-4385-9bd0-c191f4f1c5b4",
    "score": "3.5714"
  },
  {
    "name": "Tampa Tarpons",
    "pk": "1851c8e9-77e8-456b-a197-6aaed971942a",
    "score": "2.8571"
  }
]

real 0m0.196s
user 0m0.063s
sys 0m0.043s

Before getting into the Python code for the /search endpoint, the following are worthy of mention:

  1. I deliberately misspelled the name of the team; “LA Galaxy” was the one I wanted.
  2. Even though the initial character, the ‘P’ in “PA Galuxy”, was incorrect, the results were correct. This is one of the essential features of n-gram based matching – you don’t rely on the leading characters in the string to match.

On to the Python part of this interaction:

@app.route("/search/<q_base_64>/<int:limit>")
def do_search(q_base_64, limit):
  query_str = decode(q_base_64)
  logging.info("Query: {}".format(query_str))
  ng = tokenize(query_str)
  logging.info("Query (n-grams): {}".format(ng))
  sql = """
  WITH qbool AS
  (
    SELECT id, grams, 1 + ABS(ARRAY_LENGTH(grams, 1) - ARRAY_LENGTH(CAST(:ngrams AS TEXT[]), 1)) delta
    FROM teams
    WHERE grams && CAST(:ngrams AS TEXT[])
  ), qscore AS
  (
    SELECT id, COUNT(*) n FROM
    (
      SELECT id, UNNEST(grams) FROM qbool
      INTERSECT
      SELECT id, UNNEST(CAST(:ngrams AS TEXT[])) FROM qbool
    )
    GROUP BY id
  )
  SELECT qbool.id, t.name, 100*n/delta score
  FROM qbool, qscore, teams t
  WHERE qbool.id = qscore.id AND t.id = qbool.id
  ORDER BY score DESC
  LIMIT :max_rows;
  """
  stmt = text(sql).bindparams(ngrams=ng, max_rows=limit)
  rv = []
  for row in run_statement(stmt, True, False):
    pk = str(row[0])
    name = str(row[1])
    score = float(row[2]/len(ng))
    d = {}
    (d["pk"], d["name"], d["score"]) = (pk, name, '{:.4f}'.format(score))
    rv.append(d)
  return Response(json.dumps(rv), status=200, mimetype="application/json")

There’s a fair bit going on here, mostly within the SQL expression. I’ll do my best to narrate that:

  • There are two common table expressions (CTEs) which handle different aspects
  • qbool handles the boolean nature: is this row a match or not? It also provides an additional scoring input, which is the difference in the length of the provided query string and the actual value in the grams column. This is used to, for example, boost the score for the row containing “LA Galaxy” relative to a row containing “LA Galaxy II”.
  • The query predicate, WHERE grams && CAST(:ngrams AS TEXT[]), incorporates the && (array overlap) operator. Ideally, this operation would use the GIN index; this pull request addresses that, has been merged, so is currently available on main. It will be included in a future release – stay tuned.
  • qscore just uses the results of qbool to determine a score based on the number of overlapping n-grams between the matching row and the query.
  • Finally, the results of these CTEs are combined with the name column from the teams table to generate an ordered result to return to the client in JSON format (see the above example).

How to do fuzzy matching with CockroachDB

Here are some notes on how to replicate what I’ve shown above. Given I’ve already done this, it’s possible I’ll leave out some steps, so please let me know (GitHub issue is probably the simplest way). I’m going to illustrate this using an Ubuntu VM I’ve just deployed in Google Cloud Platform.

  • Step 1: Install some prereq’s (I like to use the psql CLI):

sudo apt update
sudo apt install postgresql-client-common -y
sudo apt install postgresql-client -y
sudo apt install python3-pip -y
sudo apt install libpq-dev -y
pip3 install -r requirements.txt
  • Step 2: Deploy the latest CockroachDB binary (I’ll use the beta since 22.1 hasn’t yet shipped):

$ curl https://binaries.cockroachdb.com/cockroach-v22.1.0-beta.1.linux-amd64.tgz | tar xzvf -
  • Step 3: Start “demo” mode (keep this terminal open to the CLI):

$ ./cockroach-v22.1.0-beta.1.linux-amd64/cockroach demo

In the output, you’ll see a line like the following, which you’ll use in a couple of contexts:

#     (sql)      
postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require
  • Step 4: In a new shell, clone this GitHub repo:

$ git clone https://github.com/mgoddard/hot-fuzz.git
  • Step 5: Make that repo your current working directory (the VM’s hostname is also hot-fuzz):

$ cd hot-fuzz/
$ ls
LICENSE  README.md  images  prep_teams_data.pl  pretty_print_json.py  trigrams.py
  • Step 6: Start up the Python Flask REST app:

$ export DB_CONN_STR="postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require"
$ ./trigrams.py
  • Step 7: Back in the terminal with the CLI, run the DDL, enable and then set up the changefeed:

demo@127.0.0.1:26257/movr> CREATE TABLE teams
(
  id UUID PRIMARY KEY DEFAULT GEN_RANDOM_UUID()
  , name TEXT NOT NULL
  , grams TEXT[]
  , FAMILY f1 (id, name)
  , FAMILY f2 (grams)
);
CREATE INDEX ON teams USING GIN (grams);

demo@127.0.0.1:26257/movr> SET CLUSTER SETTING kv.rangefeed.enabled = TRUE;

demo@127.0.0.1:26257/movr> CREATE CHANGEFEED FOR TABLE teams family "f1"
INTO 'webhook-https://localhost:18080/cdc?insecure_tls_skip_verify=true'
WITH updated, full_table_name, topic_in_value;
  • Step 8: In a third terminal, load the data (the Perl script is included in this repo):

$ cd hot-fuzz/
$ DB_CONN_STR="postgresql://demo:demo15932@127.0.0.1:26257/movr?sslmode=require"
$ curl -s https://en.wikipedia.org/wiki/List_of_professional_sports_teams_in_the_United_States_and_Canada | ./prep_teams_data.pl | psql $DB_CONN_STR
  • Step 9: Finally, try out that /search endpoint:

$ name="PA Galuxy"; time curl -k -s https://localhost:18080/search/$( echo -n $name | base64 )/5 | ./pretty_print_json.py
[
  {
    "name": "LA Galaxy",
    "pk": "73b541c1-e3d4-4f1d-8598-bedc11200f03",
    "score": "42.8571"
  },
  {
    "name": "LA Galaxy II",
    "pk": "3a3b0e44-68cc-45da-aa72-9467e46f0458",
    "score": "10.7143"
  },
  {
    "name": "LA Galaxy II",
    "pk": "9303e054-d6ad-4793-b904-9e3eeba28ff9",
    "score": "10.7143"
  },
  {
    "name": "LA Giltinis",
    "pk": "9b8bfbb4-1ac6-477d-aaed-79daf1183f02",
    "score": "4.7619"
  },
  {
    "name": "Tampa Mayhem",
    "pk": "f566f12f-7f8d-4ae6-b04b-832a5f9025c5",
    "score": "3.5714"
  }
]

real 0m0.049s
user 0m0.052s
sys 0m0.008s
  • That’s it!

If you’d like to

Acknowledgements

The author wishes to thank the following individuals for providing valuable input which made this blog post possible:

  • Dan Kelly, for mentioning the tweet which inspired the activity focused on n-grams
  • @schrepfler, for tagging @CockroachDB in the tweet about n-grams and fuzzy matching
  • Aaron Zinger, for a heads up about the upcoming changefeed support for column families
  • Rebecca Taft, for pushing on the PR for index acceleration of the && operator
  • Rajiv Sharma, for providing that PR and for adding the final polish to it yesterday, so it could be merged
  • Jordan Lewis, for taking up the trigram cause in the core of CockroachDB itself

Reference

Keep Reading

Highly available spatial data: Finding pubs in London

Imagine you’re driving a rental car in Rome and the satnav (or GPS) on your phone stops working. This happened to me two …

Read more
How to identify and tune a problematic query with SQL EXPLAIN

As a developer, you are often faced with the question: why is my application slow? You may have already identified the …

Read more
Building an application with CockroachDB and SQLAlchemy

CockroachDB’s support for SQLAlchemy is currently in beta, but we’re actively developing new …

Read more